Entity Consolidation: The Golden Record Problem
نویسندگان
چکیده
Four key subprocesses in data integration are: data preparation (i.e., transforming and cleaning data), schema integration (i.e., lining up like attributes), entity resolution (i.e., finding clusters of records that represent the same entity) and entity consolidation (i.e., merging each cluster into a “golden record” which contains the canonical values for each attribute). In real scenarios, the output of entity resolution typically contains multiple data formats and different abbreviations for cell values, in addition to the omnipresent problem of missing data. These issues make entity consolidation challenging. In this paper, we study the entity consolidation problem. Truth discovery systems can be used to solve this problem. They usually employ simplistic heuristics such as majority consensus (MC) or source authority to determine the golden record. However, these techniques are not capable of recognizing simple data variation, such as Jeff ↔ Jeffery, and may give biased results. To address this issue, we propose to first reduce attribute variation by merging duplicate values before applying the truth discovery system to create the golden records. For this purpose, we first align the attribute values within the same cluster to generate candidate matchings (substring pairs could be replaced by each other, e.g., 9th↔ 9 and Jeff↔ Jeffery). Then we aggregate candidate matchings with common characteristics into groups. Finally, we solicit a human to validate these matching groups and apply the approved ones to merge duplicate values. Comparing to the existing data transformation solutions, which typically try to transform an entire column from one format to another, our approach is more robust to data variety as we leverage the hidden matchings within the clusters. We tried our methods on three real world datasets. In the best case, our methods reduced the variation in clusters by 75% with high precision (>98%) by having a human confirm only 100 generated matching groups. When we invoked our algorithm prior to running MC, we were able to improve the precision of golden record creation by 40%. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 6 Copyright 2017 VLDB Endowment 2150-8097/17/02.
منابع مشابه
The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملFederated Entity Search Using On-the-Fly Consolidation
Nowadays, search on the Web goes beyond the retrieval of textual Web sites and increasingly takes advantage of the growing amount of structured data. Of particular interest is entity search, where the units of retrieval are structured entities instead of textual documents. These entities reside in different sources, which may provide only limited information about their content and are therefor...
متن کاملAddressing the Freight Consolidation and Containerization Problem by Recent and Hybridized Meta-heuristic Algorithms
Nowadays, in global free market, third-party logistics providers (3PLs) are becoming increasingly important. Hence, this study aims to develop the freight consolidation and containerization problem, which consists of loading items into containers and then shipping these containers to different warehouse they are delivered to their final destinations. In order to handle the proposed problem, thi...
متن کاملMOMA - A Mapping-based Object Matching System
Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping based object matching. It allows the construction of match workflows combining the results of several matcher algorithms on both ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.10436 شماره
صفحات -
تاریخ انتشار 2017